The objectives of this work is to analyze data from universities in the US by applying unsupervised learning techniques, such as PCA, FA and clustering. Our intentions with the project is that we could separate observations, i.e. universities, according to the type (public/private) or even to how good the university is. This would gave us a wider look and, hopefully, a better understanding of the post-secondary education in the United States.

1 Data:

I have considered a dataset that can be found in the web of the Integrated Postsecondary Education Data System, IPEDS. This dataset was also considered in the USNEWS for the ASA Statistical Graphics Section’s of 1995 Data Analysis Exposition. This dataset contains information on over 1000 American colleges and universities. The dataset can be found in the following link: https://nces.ed.gov/ipeds/datacenter/InstitutionByGroup.aspx. The following variables have been considered:

Variable Information
FICE Federal ID number
X College name
State Postal code of the state
Public/private Indicator of public=1, private=2
Av_Math_SAT Average Math SAT score
Av_Verbal_SAT Average Verbal SAT score
Av_Comb_SAT Average Combined SAT score
Av_ACT_score Average ACT score
1Q_Math_SAT First quantile - Math SAT
3Q_Math_SAT Third quantile - Math SAT
1Q_Verbal_SAT First quantile - Verbal SAT
3Q_Verbal_SAT Third quantile - Verbal SAT
1Q_ACT First quantile - ACT
3Q_ACT Third quantile - ACT
Apps Number of applications received
Accepted Number of applicants accepted
Enrolled Number of new students enrolled
Top10perc Pct. new students from top 10% of H.S. class
Top25perc Pct. new students from top 25% of H.S. class
Full_Under Number of fulltime undergraduates
Part_Under Number of parttime undergraduates
In-state In-state tuition
Out-of-state Out-of-state tuition
Room_board Room and board costs
Room Room costs
Board Board costs
Fees Additional fees
Book Estimated book costs
Personal Estimated personal spending
Faculty_PhD Pct. of faculty with Ph.D.’s
Faculty_Terminal Pct. of faculty with terminal degree
SF_Ratio Student/faculty ratio
Perc_Donate Pct.alumni who donate
Instructional Instructional expenditure per student
Grad_Rate Graduation rate
data = read.csv("collegedata.csv", header = F, sep = ",", na.strings = "*")

2 Data Preprocessing:

Before starting with the models we need to explore and modify the dataset in order to have a nice and clean input.

2.1 Librarires:

library(tidyverse)
library(leaflet)
library(rgdal)
library(stringr)
library(htmltab)
library(GGally)
library(factoextra)
library(fastDummies)
library(ggplot2)
library(cluster)

2.2 Discard not interesting variables:

Once we have the data loaded, we are going to change the names of the variables and discard some of them that are not interesting for the project:

colnames(data) = c("FICE","X","State","Public_Private","Av_Math_SAT","Av_Verbal_SAT","Av_Comb_SAT","Av_ACT_score","1Q_Math_SAT","3Q_Math_SAT","1Q_Verbal_SAT","3Q_Verbal_SAT","1Q_ACT","3Q_ACT","Apps","Accepted","Enrolled","Top10perc","Top25perc","Full_Under","Part_Under","In_State","Out_State","Room_Board","Room","Board","Fees","Books","Personal","Faculty_PhD","Faculty_Terminal","SF_Ratio","Perc_Donate","Instructional","Grad_Rate")
data = data %>% select(-c(1,5,6,7,8,9,10,11,12,13,14))

Now, we take a look at the types of the variables we have, just in case we need to make modifications. In this case we check that all variables have the right type except X and Public_Private, that need to be factors. However, they are going to be changed later in the Profiles section.

Hence, now we are going to discard Room and Board variables because there exists already a variable, Room_Board, that is the sum of them. Moreover, we are going to delete Top10perc as people in there is obviously included in Top25perc. This is done in order to avoid correlations later.

str(data)
## 'data.frame':    1302 obs. of  24 variables:
##  $ X               : chr  "Alaska Pacific University" "University of Alaska at Fairbanks" "University of Alaska Southeast" "University of Alaska at Anchorage" ...
##  $ State           : chr  "AK" "AK" "AK" "AK" ...
##  $ Public_Private  : int  2 1 1 1 1 2 1 1 1 2 ...
##  $ Apps            : int  193 1852 146 2065 2817 345 1351 4639 7548 805 ...
##  $ Accepted        : int  146 1427 117 1598 1920 320 892 3272 6791 588 ...
##  $ Enrolled        : int  55 928 89 1162 984 179 570 1278 3070 287 ...
##  $ Top10perc       : int  16 NA 4 NA NA NA 18 NA 25 67 ...
##  $ Top25perc       : int  44 NA 24 NA NA 27 78 NA 57 88 ...
##  $ Full_Under      : int  249 3885 492 6209 3958 1367 2385 4051 16262 1376 ...
##  $ Part_Under      : int  869 4519 1849 10537 305 578 331 405 1716 207 ...
##  $ In_State        : int  7560 1742 1742 1742 1700 5600 2220 1500 2100 11660 ...
##  $ Out_State       : int  7560 5226 5226 5226 3400 5600 4440 3000 6300 11660 ...
##  $ Room_Board      : int  4120 3590 4764 5120 2550 3250 3030 1960 3933 4325 ...
##  $ Room            : int  1620 1800 2514 2600 1108 1550 NA 1960 NA 2050 ...
##  $ Board           : int  2500 1790 2250 2520 1442 1700 NA NA NA 2430 ...
##  $ Fees            : int  130 155 34 114 155 300 124 84 NA 120 ...
##  $ Books           : int  800 650 500 580 500 350 300 500 600 400 ...
##  $ Personal        : int  1500 2304 1162 1260 850 NA 600 NA 1908 900 ...
##  $ Faculty_PhD     : int  76 67 39 48 53 52 72 48 85 74 ...
##  $ Faculty_Terminal: int  72 NA 51 NA 53 56 72 53 91 79 ...
##  $ SF_Ratio        : num  11.9 10 9.5 13.7 14.3 32.8 18.9 18.7 16.7 14 ...
##  $ Perc_Donate     : int  2 8 NA 6 NA NA 8 NA 18 34 ...
##  $ Instructional   : int  10922 11935 9584 8046 7043 3971 5883 NA 6642 8649 ...
##  $ Grad_Rate       : int  15 NA 39 NA 40 55 51 15 69 72 ...
data$Room = NULL
data$Board = NULL
data$Top10perc = NULL

2.3 Missing Values, NAs:

We check whether if the data have missing values (NAs) or not, and where. First, we are going to create a copy, full_data, of data in order to compare later some values.

Next, using the VIM package we can see in a graph the distribution of NAs and the missing values for each variable. We see that in 4 variables the percentage of missing data is around 10-20% but in others lower than 5%. For this reason, we can consider a threshold (for example the value for Grad_Rate, because from there the distribution is more stable) for removing observations that are NA regarding the variables which percentage of NA is higher than that threshold.

# plot of NAs:
na_plot = aggr(data, col=c('#69b3a2','red'), numbers=TRUE, sortVars=TRUE, labels=names(data), cex.axis=.7, gap=3, ylab=c("Histogram of NAs","Pattern"))

## 
##  Variables sorted by number of missings: 
##          Variable       Count
##              Fees 0.210445469
##       Perc_Donate 0.170506912
##         Top25perc 0.155145929
##          Personal 0.139016897
##         Grad_Rate 0.075268817
##        Room_Board 0.058371736
##             Books 0.036866359
##     Instructional 0.029953917
##        Part_Under 0.024577573
##       Faculty_PhD 0.024577573
##          In_State 0.023041475
##  Faculty_Terminal 0.023041475
##         Out_State 0.015360983
##          Accepted 0.008448541
##              Apps 0.007680492
##          Enrolled 0.003840246
##        Full_Under 0.002304147
##          SF_Ratio 0.001536098
##                 X 0.000000000
##             State 0.000000000
##    Public_Private 0.000000000

These variables are Fees, Perc_Donate, Top25perc and Personal. However, Personal is not consider because once we have deleted rows of Top25perc and Perc_Donate, the proportion of NA is below the considered threshold. It is important to consider that variable Fees is going to be part of a new variable, TotalCost, and its contribution to it is small. For this reason, we are going to use MICE to estimate the missing values. Moreover, if we omit the corresponding observations we are going to lose information of a lot of important universities.

# delete NAs of Top25perc and Perc_Donate:
data = data[!is.na(data$Top25perc),]
data = data[!is.na(data$Perc_Donate),]

# new plot of NAs:
na_plot = aggr(data, col=c('#69b3a2','red'), numbers=TRUE, sortVars=TRUE, labels=names(data), cex.axis=.7, gap=3, ylab=c("Histogram of NAs","Pattern"))

## 
##  Variables sorted by number of missings: 
##          Variable       Count
##              Fees 0.188087774
##          Personal 0.070010449
##        Room_Board 0.036572623
##         Grad_Rate 0.034482759
##        Part_Under 0.024033438
##          In_State 0.018808777
##       Faculty_PhD 0.017763845
##             Books 0.012539185
##  Faculty_Terminal 0.010449321
##         Out_State 0.009404389
##     Instructional 0.006269592
##              Apps 0.004179728
##          Accepted 0.002089864
##          Enrolled 0.001044932
##                 X 0.000000000
##             State 0.000000000
##    Public_Private 0.000000000
##         Top25perc 0.000000000
##        Full_Under 0.000000000
##          SF_Ratio 0.000000000
##       Perc_Donate 0.000000000

Before using MICE, we are going to consider a new variable, Acc_Rate, that would be of interest in order to eliminate correlations between Apps and Accept, and to catalog later the universities. This is also done here in order to avoid Acc_Rate to be \(>\) 1 due to predictions of Apps and Accept. Obviously, we delete Apps and Accept.

data$Acc_Rate = data$Accepted/data$Apps
data$Accepted = NULL
data$Apps = NULL

We have different options in order to handle this NAs. But, as anticipated, we are going to give values to them by using the “Multiple Imputation” technique (MICE), in order to capture a better sample variability.

# model:
mice = mice(data, method = 'rf')
mice.com = mice::complete(mice)
# variables that have NAs:
data$Personal = mice.com$Personal
data$Grad_Rate = mice.com$Grad_Rate
data$Room_Board = mice.com$Room_Board
data$Part_Under = mice.com$Part_Under
data$Faculty_PhD = mice.com$Faculty_PhD
data$In_State = mice.com$In_State
data$Room_Board = mice.com$Room_Board
data$Out_State = mice.com$Out_State
data$Instructional = mice.com$Instructional
data$Books = mice.com$Books
data$Faculty_Terminal = mice.com$Faculty_Terminal
data$Fees = mice.com$Fees
data$Enrolled = mice.com$Enrolled
data$Acc_Rate = mice.com$Acc_Rate

# plot without NAs
na_plot = aggr(data, col=c('#69b3a2','red'), numbers=TRUE, sortVars=TRUE, labels=names(data), cex.axis=.7, gap=3, ylab=c("Histogram of NAs","Pattern"))

## 
##  Variables sorted by number of missings: 
##          Variable Count
##                 X     0
##             State     0
##    Public_Private     0
##          Enrolled     0
##         Top25perc     0
##        Full_Under     0
##        Part_Under     0
##          In_State     0
##         Out_State     0
##        Room_Board     0
##              Fees     0
##             Books     0
##          Personal     0
##       Faculty_PhD     0
##  Faculty_Terminal     0
##          SF_Ratio     0
##       Perc_Donate     0
##     Instructional     0
##         Grad_Rate     0
##          Acc_Rate     0

A natural question to ask is: Have sense the values predicted by MICE?. For this reason, we would make a comparison of the values of Fees (that is why we created previously full_data), because it was the variable with highest percentage of NAs, with and without MICE.

# fees with MICE:
fees_mice = ggplot() + geom_boxplot(aes(x=data$Fees, color="red")) + theme(legend.position="none")

# fees without MICE:
fees_Nmice = ggplot() + geom_boxplot(aes(x=full_data$Fees)) +
    theme(legend.position="none")

# both plots:
comb_plot = ggarrange(fees_mice, fees_Nmice,
                    labels = c("Values of Fees with MICE", "Values of Fees in original dataset without MICE"),
                    ncol = 1, nrow = 2); comb_plot

We see that values are similar, so the imputation with MICE is acceptable.

2.4 Non-Reasonable values:

Let´s get a wider look of the data by using the summary command.

summary(data)
##       X                State           Public_Private     Enrolled     
##  Length:957         Length:957         Min.   :1.000   Min.   :  32.0  
##  Class :character   Class :character   1st Qu.:1.000   1st Qu.: 238.0  
##  Mode  :character   Mode  :character   Median :2.000   Median : 437.0  
##                                        Mean   :1.697   Mean   : 785.6  
##                                        3rd Qu.:2.000   3rd Qu.: 942.0  
##                                        Max.   :2.000   Max.   :6392.0  
##    Top25perc        Full_Under      Part_Under         In_State    
##  Min.   :  9.00   Min.   :  118   Min.   :    1.0   Min.   :  608  
##  1st Qu.: 39.00   1st Qu.:  978   1st Qu.:  106.0   1st Qu.: 3076  
##  Median : 52.00   Median : 1707   Median :  414.0   Median : 9200  
##  Mean   : 54.48   Mean   : 3747   Mean   :  953.1   Mean   : 8866  
##  3rd Qu.: 68.00   3rd Qu.: 4384   3rd Qu.: 1145.0   3rd Qu.:12660  
##  Max.   :100.00   Max.   :31643   Max.   :21836.0   Max.   :25750  
##    Out_State       Room_Board        Fees            Books       
##  Min.   : 2340   Min.   :1560   Min.   :  10.0   Min.   :  96.0  
##  1st Qu.: 6987   1st Qu.:3526   1st Qu.: 129.0   1st Qu.: 480.0  
##  Median : 9540   Median :4160   Median : 270.0   Median : 500.0  
##  Mean   :10172   Mean   :4323   Mean   : 389.6   Mean   : 549.4  
##  3rd Qu.:12669   3rd Qu.:5050   3rd Qu.: 480.0   3rd Qu.: 600.0  
##  Max.   :25750   Max.   :8700   Max.   :4374.0   Max.   :2340.0  
##     Personal     Faculty_PhD     Faculty_Terminal    SF_Ratio    
##  Min.   :  75   Min.   :  8.00   Min.   : 20.0    Min.   : 2.50  
##  1st Qu.: 875   1st Qu.: 61.00   1st Qu.: 68.0    1st Qu.:11.50  
##  Median :1230   Median : 74.00   Median : 81.0    Median :13.60  
##  Mean   :1370   Mean   : 71.78   Mean   : 78.5    Mean   :14.16  
##  3rd Qu.:1700   3rd Qu.: 85.00   3rd Qu.: 91.0    3rd Qu.:16.60  
##  Max.   :6900   Max.   :103.00   Max.   :100.0    Max.   :39.80  
##   Perc_Donate    Instructional     Grad_Rate         Acc_Rate      
##  Min.   : 0.00   Min.   : 3186   Min.   : 10.00   Min.   :0.09139  
##  1st Qu.:12.00   1st Qu.: 6591   1st Qu.: 52.00   1st Qu.:0.67283  
##  Median :20.00   Median : 8135   Median : 64.00   Median :0.77916  
##  Mean   :21.93   Mean   : 9590   Mean   : 64.11   Mean   :0.74508  
##  3rd Qu.:30.00   3rd Qu.:10779   3rd Qu.: 77.00   3rd Qu.:0.85019  
##  Max.   :64.00   Max.   :62469   Max.   :118.00   Max.   :1.00000

We see that in PhD and Grad_Rate we have maximum values that exceed 100, and because the variables are percentages, we can discard those observations from the data.

data = data %>% filter(data$Faculty_PhD<=100 & data$Grad_Rate<=100)

2.5 Duplicates:

Now that the data have reasonable values, so we are going to look for any duplicated rows, and if so, discard them:

# duplicated colleges:
table(data$X)[which(table(data$X)>1)]
## 
##        Augustana College          Bethany College           Bethel College 
##                        2                        2                        3 
##         Columbia College     Concordia University           Judson College 
##                        2                        2                        2 
##         Monmouth College     Northwestern College    Saint Francis College 
##                        2                        2                        2 
##   Saint Joseph's College          Trinity College            Union College 
##                        2                        3                        2 
## University of St. Thomas      Westminster College          Wheaton College 
##                        2                        2                        2
# cleaned data:
data = data %>% distinct(X, .keep_all = T)

2.6 Profiles:

Last thing to consider before starting with the models is that we are going to store some variables as profiles, in order to later interpret better the models and to have the dataset with numeric variables. Those variables are X, Public_Private and State. Moreover, we are going to create a profile of how Elite is a college:

# college names:
data$X = as.factor(data$X)
row.names(data) = data$X
college.names = data$X
data$X = NULL

# public/private:
college.pub_priv = data$Public_Private

# states:
college.states = data$State
data$State = NULL
par(mfrow = c(1, 2))
barplot(table(college.states), col="#69b3a2", main="Distribution of Universities per State")
barplot(table(college.pub_priv), col="#69b3a2", main="Distribution of Universities per Type", sub = "1: Public / 2: Private")

We would consider that elite universities have Top25perc >= 0.95 and Acc_Rate < 0.40 (as they can be represented as outliers in the boxplot). That would tell us that being accepted in that universities is very hard and exclusive, and that the students are at least in the third quantile of best students in the country.

par(mfrow = c(1, 2))
boxplot(data$Acc_Rate, col="#69b3a2", main = "Acc_Rate boxplot")
boxplot(data$Top25perc, col="#69b3a2", main = "Top25perc boxplot")

# elite universities:
getElite = function(df){
  vector = c()
  for (i in 1:nrow(df)){
    if (data$Acc_Rate[i] <= 0.40 & data$Top25perc[i] >= 90){
      vector[i] = 1
    }
    else{
      vector[i] = 0
    }

  }
  vector
}
college.elite = getElite(data)
barplot(table(college.elite), col="#69b3a2", main="Elite universities", sub = "0: Not Elite / 1: Elite")

Also, we are going to consider another profile, Region, that groups states in regions in the US, as established by the U.S. Census Bureau (https://www2.census.gov/geo/pdfs/maps-data/maps/reference/us_regdiv.pdf).

# regions:
groupUS = function(states){
  for (i in 1:length(states)){
    if (states[i] %in% c("CA", "OR", "WA", "NV", "ID", "MT", "WY", "UT", "CO", "AZ", "NM", "HI", "AK")){
      states[i] = "West"
    }
    
    if (states[i] %in% c("PA", "NJ", "NY", "VT", "RI", "MA", "NH", "ME", "CT")){
      states[i] = "NorthEast"
    }
    if (states[i] %in% c("TX", "OK", "AR", "LA", "MS", "AL", "TN", "KY", "WV", "DE", "MD", "DC", "VA", "NC", "SC", "GA", "FL")){
      states[i] = "South"
    }
    if (states[i] %in% c("IL", "IN", "MI", "OH", "WI", "IA", "KS", "MN", "MO", "NE", "ND", "SD")){
      states[i] = "North-Central"
    }
    
  }
  states
}
college.regions = groupUS(college.states)
barplot(table(college.regions), col="#69b3a2")

library(usmap)
# maps of regions:
west_plot = usmap::plot_usmap(include = .west_region, labels = TRUE, fill = "red", alpha = 0.25)
south_plot = usmap::plot_usmap(include = .south_region, labels = TRUE, fill = "orange", alpha = 0.25)
northcentral_plot = usmap::plot_usmap(include = .north_central_region, labels = TRUE, fill = "green", alpha = 0.25)
northeast_plot = usmap::plot_usmap(include = .northeast_region, labels = TRUE, fill = "blue", alpha = 0.25)

# combined regions plot:
regions_plot = ggarrange(west_plot, south_plot, northcentral_plot, northeast_plot,
                    labels = c("West", "South", "North-Central", "North-East"),
                    ncol = 4, nrow = 1, align = "v"); regions_plot

2.7 New variables:

It is interesting to consider the total cost (we will assume per semester) as it is done in the following notebook https://www.kaggle.com/lunamcbride24/college-exploration-and-regression. This would give us a better understanding later on when we apply unsupervised learning techniques. But first, it is important to notice that Outstate and Instate cannot be added at the same time. For this reason, we are going to consider the mean. As expected, variables that forms the TotalCost variable are removed.

  • Total Cost:
# new variable:
getTotalCost = function(df){
  cost = c()
  for (row in 1:nrow(df)){
    cost[row] = sum(mean(df$Out_State[row], df$In_State[row]), df$Room_Board[row], df$Books[row], df$Personal[row], df$Fees[row])
  }
  cost
}

data$TotalCost = getTotalCost(data)

# remove variables:
data$In_State = NULL
data$Out_State = NULL
data$Room_Board = NULL
data$Fees = NULL
data$Books = NULL
data$Personal = NULL

Also we are going to consider another variable, Total_Under, that refers to the number of Full time undergraduates, Full_Under, and part time undergraduates, Part_Under:

  • Total Undergraduates:
# new variable:
getTotalUnder = function(df){
  under = c()
  for (row in 1:nrow(df)){
    under[row] = sum(df$Full_Under[row], df$Part_Under[row])
  }
  under
}

data$Total_Under = getTotalUnder(data)

# remove variables:
data$Full_Under = NULL
data$Part_Under = NULL

2.8 Outliers:

It is important to detect properly the outliers. For this reason, it is better to use box plots. We see that in Acc_Rate, Enrolled, Instructional and Total_Under we have many outliers. However, the decision after several tests is to keep them because they are colleges that somehow are going to be useful in order to group and rank universities.

data %>%
  as.tibble() %>%
  gather(variable, value) %>%
  ggplot(aes(x=value)) +
    geom_boxplot(fill= "#69b3a2") + facet_wrap(~variable, scale="free")

2.9 Public/Private variable:

Before creating the models, we are going to create a copy of data that will not contain the Public/Private variable, in order to compare results of the outputs of the models.

# data without Public/Private:
out_data = data %>% select(-1)

3 Exploratory Data Analysis:

It is highly important to dedicated some time in exploring the variables we have in our dataset before starting with the models.

summary(data)
##  Public_Private     Enrolled        Top25perc      Faculty_PhD    
##  Min.   :1.000   Min.   :  32.0   Min.   :  9.0   Min.   :  8.00  
##  1st Qu.:1.000   1st Qu.: 239.2   1st Qu.: 39.0   1st Qu.: 61.00  
##  Median :2.000   Median : 443.5   Median : 52.5   Median : 74.00  
##  Mean   :1.692   Mean   : 796.1   Mean   : 54.6   Mean   : 71.91  
##  3rd Qu.:2.000   3rd Qu.: 959.5   3rd Qu.: 68.0   3rd Qu.: 85.00  
##  Max.   :2.000   Max.   :6392.0   Max.   :100.0   Max.   :100.00  
##  Faculty_Terminal    SF_Ratio      Perc_Donate    Instructional  
##  Min.   : 20.00   Min.   : 2.50   Min.   : 0.00   Min.   : 3186  
##  1st Qu.: 68.00   1st Qu.:11.50   1st Qu.:12.00   1st Qu.: 6568  
##  Median : 81.00   Median :13.65   Median :20.00   Median : 8135  
##  Mean   : 78.58   Mean   :14.18   Mean   :21.86   Mean   : 9595  
##  3rd Qu.: 91.00   3rd Qu.:16.68   3rd Qu.:30.00   3rd Qu.:10804  
##  Max.   :100.00   Max.   :39.80   Max.   :64.00   Max.   :62469  
##    Grad_Rate         Acc_Rate         TotalCost      Total_Under   
##  Min.   : 10.00   Min.   :0.09139   Min.   : 6260   Min.   :  142  
##  1st Qu.: 52.00   1st Qu.:0.67070   1st Qu.:13300   1st Qu.: 1230  
##  Median : 64.00   Median :0.77903   Median :15984   Median : 2240  
##  Mean   : 63.99   Mean   :0.74398   Mean   :16790   Mean   : 4767  
##  3rd Qu.: 77.00   3rd Qu.:0.84986   3rd Qu.:19869   3rd Qu.: 5633  
##  Max.   :100.00   Max.   :1.00000   Max.   :31560   Max.   :38338
data %>%
  as.tibble() %>%
  gather(variable, value) %>%
  ggplot(aes(x=value)) +
    geom_histogram(fill= "#69b3a2") + facet_wrap(~variable, scale="free")

We can see that variability across the variables is highly different. In Instructional and Enrolled is low but in Grad_Rate and Top25perc is high. This points out that they could be good variables to rank observations. We can also see that the maximum of Acc_Rate is 1, which is surprising as every student is accepted.

4 PCA:

Principal Component Analysis is an Unsupervised Learning tool that will let us reduce dimensionality of the variables.

4.1 Data with Public/Private:

We have seen previously that values are highly spread (percentages, ratios, total cost, etc…) so we are going to scale the data. PCA is going to maximize the variance in order to rank observations.

boxplot(scale(data), las=2, col="#69b3a2")

4.1.1 Correlations:

It is important to consider the correlation matrix in order to see hide correlations.

library(GGally)
ggcorr(data, label = T)

As expected, Faculty_Terminal and Faculty_PhD have a strong correlation basically because people with PhD are also considered as people with terminal degree. The same happens with Enrolled and Total_Under. For this reason, we are not going to consider Faculty_PhD and Enrolled. We also see some natural strong correlations such as Instructional and TotalCost (if the spend by the college is high, the total cost would be higher), and also but negative, Total_Under and Public/Private (public universities may have more students).

data$Enrolled = NULL
data$Faculty_PhD = NULL
ggcorr(data, label = T)

4.1.2 Model:

# model:
pca = prcomp(data, scale=T)
summary(pca)
## Importance of components:
##                           PC1    PC2     PC3     PC4     PC5     PC6     PC7
## Standard deviation     2.0162 1.4512 0.89580 0.86563 0.73314 0.69389 0.63434
## Proportion of Variance 0.4065 0.2106 0.08025 0.07493 0.05375 0.04815 0.04024
## Cumulative Proportion  0.4065 0.6171 0.69738 0.77231 0.82606 0.87421 0.91445
##                            PC8     PC9    PC10
## Standard deviation     0.59235 0.55129 0.44805
## Proportion of Variance 0.03509 0.03039 0.02008
## Cumulative Proportion  0.94953 0.97992 1.00000

Checking the screeplot, we notice that around 60% of total variance is explained by the two first components. This would we sufficient for our intentions. For this reason, we have only considered two components.

fviz_screeplot(pca, addlabels = TRUE, barfill = "#69b3a2", barcolor = "#69b3a2")

We can also see a plot with the loadings and notice that we may have two differentiated clusters. But which are them? We could guess that are public and private universities, but later we would confirm it.

library(ggfortify)
pca_plot = autoplot(pca, data = data,
         label = TRUE, label.size = 3, shape = FALSE,
         loadings = TRUE, loadings.colour = 'blue',
         loadings.label = TRUE, loadings.label.size = 5)

library(plotly)
plotly::ggplotly(pca_plot)

First component:

How good is a University

fviz_contrib(pca, choice = "var", axes = 1, fill = "#69b3a2", color = "#69b3a2")

A minus sign is included just to better understand the plot:

barplot(-pca$rotation[,1], las=2, col="#69b3a2")

In this plot we need to interpret as usual the signs. So the value is lower if the bar is more negative, and the other way. SF_Ratio, Acc_Rate and Total_Under have different signs. This tell us what we should have in mind: a low value of SF_Ratio means that teaching is more personal as there are few students per teacher. Moreover, small values of Acc_Rate and Total_Under mean that getting accepted in the college is highly difficult so there are a small number of students in the university.

However, values of TotalCost, Instructional, Grad_Rate, Top25perc and the rest of the variables are high. This tell us that the total cost of a semester is high, the percentage of students in the top 25% is also high, as well as the graduation rate.

Taking all of this into account, we see that this features are those that can be associated to “Better Universities”. So, in this first PC, we can rank the universities with a sense of how good is that college.

Top 25 Universities:

head(college.names[order(pca$x[,1])], 25)
##  [1] California Institute of Technology    Johns Hopkins University             
##  [3] Yale University                       Harvard University                   
##  [5] Dartmouth College                     Princeton University                 
##  [7] Washington University                 University of Chicago                
##  [9] Amherst College                       Duke University                      
## [11] Williams College                      Massachusetts Institute of Technology
## [13] Wake Forest University                Stanford University                  
## [15] Swarthmore College                    Cooper Union                         
## [17] Columbia University                   Middlebury College                   
## [19] University of Pennsylvania            Harvey Mudd College                  
## [21] Bowdoin College                       Brown University                     
## [23] Emory University                      Pomona College                       
## [25] Wellesley College                    
## 938 Levels: Abilene Christian University Adelphi University ... Youngstown State University

Bottom 25 Universities:

tail(college.names[order(pca$x[,1])], 25)
##  [1] Henderson State University             
##  [2] Northern Michigan University           
##  [3] Barber Scotia College                  
##  [4] Northwest Missouri State University    
##  [5] Chadron State College                  
##  [6] Wayne State College                    
##  [7] Youngstown State University            
##  [8] Western New Mexico University          
##  [9] Indiana Univ. Northwest                
## [10] Prairie View A. and M. University      
## [11] Indiana Wesleyan University            
## [12] University of Central Arkansas         
## [13] West Liberty State College             
## [14] West Texas A&M University              
## [15] Arkansas Tech University               
## [16] University of Southern Indiana         
## [17] University of Texas at Arlington       
## [18] East Central University                
## [19] Weber State University                 
## [20] University of Sci. and Arts of Oklahoma
## [21] Metropolitan State College             
## [22] Goldey Beacom College                  
## [23] Northeast Louisiana University         
## [24] Mesa State College                     
## [25] Black Hills State University           
## 938 Levels: Abilene Christian University Adelphi University ... Youngstown State University

Second component:

How demanding/popular and public is a University

fviz_contrib(pca, choice = "var", axes = 2, fill = "#69b3a2", color = "#69b3a2")

barplot(pca$rotation[,2], las=2, col="#69b3a2")

We see that variables Total_Under and Public/Private are the most important in this second PC. however, the signs are different. This tell us that the number of students in a college is crucial in this PC, and the Public/Private interpretation could be how public the university is.

For this reasons, this second PC indicates how demanding/popular and public a university is. So, now we can rank the universities by their second PC scores:

Most demanding Universities:

head(college.names[order(pca$x[,2])], 25)
##  [1] University of Texas at Austin             
##  [2] Pennsylvania State Univ. Main Campus      
##  [3] Texas A&M Univ. at College Station        
##  [4] University of Minnesota Twin Cities       
##  [5] University of California at Los Angeles   
##  [6] University of California at Berkeley      
##  [7] Rutgers at New Brunswick                  
##  [8] Ohio State University at Columbus         
##  [9] University of Florida                     
## [10] University of Central Florida             
## [11] University of Washington                  
## [12] University of South Florida               
## [13] University of Illinois - Urbana           
## [14] Arizona State University Main campus      
## [15] Florida State University                  
## [16] Michigan State University                 
## [17] Indiana University at Bloomington         
## [18] North Carolina State University at Raleigh
## [19] University of Houston-Main Campus         
## [20] University of Michigan at Ann Arbor       
## [21] University of Wisconsin at Madison        
## [22] Purdue University at West Lafayette       
## [23] University of California at San Diego     
## [24] San Diego State University                
## [25] University of California at Davis         
## 938 Levels: Abilene Christian University Adelphi University ... Youngstown State University

Least demanding Universities:

tail(college.names[order(pca$x[,2])], 25)
##  [1] Dana College                        Lees-McRae College                 
##  [3] St. Paul's College                  Wisconsin Lutheran College         
##  [5] Salem-Teikyo University             Teikyo Post University             
##  [7] Pacific Christian College           Molloy College                     
##  [9] Mount Mary College                  Brescia College                    
## [11] Adelphi University                  Mount Marty College                
## [13] Curry College                       College of St. Joseph              
## [15] Dominican College of Blauvelt       Alvernia College                   
## [17] Lasell College                      Siena Heights College              
## [19] Montana State University - Northern Barber Scotia College              
## [21] Marian College of Fond du Lac       New England College                
## [23] MidAmerica Nazarene College         Mount Saint Clare College          
## [25] Lourdes College                    
## 938 Levels: Abilene Christian University Adelphi University ... Youngstown State University

4.2 Data without Public/Private:

Once we have computed PCA considering the Public/Private variable, we could do the same but not considering it:

4.2.1 Correlations:

It is important to consider the correlation matrix in order to see hide correlations.

library(GGally)
ggcorr(out_data, label = T)

As expected, Faculty_Terminal and Faculty_PhD have a strong correlation basically because people with PhD are also considered as people with terminal degree. The same happens with Enrolled and Total_Under. For this reason, we are not going to consider Faculty_PhD and Enrolled.

out_data$Enrolled = NULL
out_data$Faculty_PhD = NULL
ggcorr(out_data, label = T)

4.2.2 Model:

# model:
pca_out = prcomp(out_data, scale=T)
summary(pca_out)
## Importance of components:
##                           PC1    PC2     PC3     PC4     PC5     PC6     PC7
## Standard deviation     1.9653 1.2693 0.89578 0.86488 0.72732 0.68464 0.63098
## Proportion of Variance 0.4291 0.1790 0.08916 0.08311 0.05878 0.05208 0.04424
## Cumulative Proportion  0.4291 0.6081 0.69730 0.78041 0.83919 0.89127 0.93551
##                            PC8     PC9
## Standard deviation     0.57250 0.50266
## Proportion of Variance 0.03642 0.02807
## Cumulative Proportion  0.97193 1.00000
fviz_screeplot(pca_out, addlabels = TRUE, barfill = "#69b3a2", barcolor = "#69b3a2")

library(ggfortify)
pca_plot = autoplot(pca_out, data = out_data,
         label = TRUE, label.size = 3, shape = FALSE,
         loadings = TRUE, loadings.colour = 'blue',
         loadings.label = TRUE, loadings.label.size = 5)

library(plotly)
plotly::ggplotly(pca_plot)

We can see that observations are not as separated as with the Public/Private variable. This is because now what is more important is Total_Under and we have seen previously in 2.7. that its variability is low, which indicates that observations are very similar in this variable.

First component:

How **good* is a University

fviz_contrib(pca_out, choice = "var", axes = 1, fill = "#69b3a2", color = "#69b3a2")

barplot(pca_out$rotation[,1], las=2, col="#69b3a2")

We see that the interpretation of this first PC without Public/Private is the same as with Public/Private. So, again we can rank the universities by their first PC scores:

Top 25 Universities:

head(college.names[order(pca_out$x[,1])], 25)
##  [1] Black Hills State University           
##  [2] Goldey Beacom College                  
##  [3] Mesa State College                     
##  [4] Northeast Louisiana University         
##  [5] Barber Scotia College                  
##  [6] Indiana Wesleyan University            
##  [7] University of Sci. and Arts of Oklahoma
##  [8] Metropolitan State College             
##  [9] East Central University                
## [10] Friends University                     
## [11] Weber State University                 
## [12] West Liberty State College             
## [13] Voorhees College                       
## [14] University of Southern Indiana         
## [15] Arkansas Tech University               
## [16] Lee College                            
## [17] Brewton-Parker College                 
## [18] University of Texas at Arlington       
## [19] West Texas A&M University              
## [20] Western New Mexico University          
## [21] University of Central Arkansas         
## [22] Prairie View A. and M. University      
## [23] Wayne State College                    
## [24] Chadron State College                  
## [25] Indiana Univ. Northwest                
## 938 Levels: Abilene Christian University Adelphi University ... Youngstown State University

Bottom 25 Universities:

tail(college.names[order(pca_out$x[,1])], 25)
##  [1] Wellesley College                     Pomona College                       
##  [3] Emory University                      Harvey Mudd College                  
##  [5] Bowdoin College                       Brown University                     
##  [7] Middlebury College                    University of Pennsylvania           
##  [9] Columbia University                   Cooper Union                         
## [11] Swarthmore College                    Wake Forest University               
## [13] Stanford University                   Williams College                     
## [15] Massachusetts Institute of Technology Duke University                      
## [17] Amherst College                       University of Chicago                
## [19] Washington University                 Princeton University                 
## [21] Dartmouth College                     Harvard University                   
## [23] Yale University                       Johns Hopkins University             
## [25] California Institute of Technology   
## 938 Levels: Abilene Christian University Adelphi University ... Youngstown State University

Second component:

How **demanding/popular* is a University:

fviz_contrib(pca_out, choice = "var", axes = 2, fill = "#69b3a2", color = "#69b3a2")

barplot(pca_out$rotation[,2], las=2, col="#69b3a2")

We see that variables Total_Under is the most important in this second PC. This is a significant different because we have no more the Public/Private variable. So, this indicates how demanding or popular a university is.
Now we can rank the universities by their second PC scores:

Most demanding Universities:

head(college.names[order(pca_out$x[,2])], 25)
##  [1] University of Texas at Austin             
##  [2] Texas A&M Univ. at College Station        
##  [3] Pennsylvania State Univ. Main Campus      
##  [4] University of Minnesota Twin Cities       
##  [5] University of Central Florida             
##  [6] Arizona State University Main campus      
##  [7] Ohio State University at Columbus         
##  [8] Rutgers at New Brunswick                  
##  [9] University of California at Berkeley      
## [10] Florida State University                  
## [11] University of South Florida               
## [12] University of California at Los Angeles   
## [13] University of North Texas                 
## [14] University of Florida                     
## [15] University of Houston-Main Campus         
## [16] Brigham Young University at Provo         
## [17] Indiana University at Bloomington         
## [18] Michigan State University                 
## [19] University of Illinois - Urbana           
## [20] San Diego State University                
## [21] Purdue University at West Lafayette       
## [22] North Carolina State University at Raleigh
## [23] University of Washington                  
## [24] University of Texas at San Antonio        
## [25] Texas Tech University                     
## 938 Levels: Abilene Christian University Adelphi University ... Youngstown State University

Least demanding Universities:

tail(college.names[order(pca_out$x[,2])], 25)
##  [1] Wheelock College                    Mount Marty College                
##  [3] Alvernia College                    McPherson College                  
##  [5] Adelphi University                  MidAmerica Nazarene College        
##  [7] Salem-Teikyo University             Wisconsin Lutheran College         
##  [9] Mary Baldwin College                Molloy College                     
## [11] College of Mount St. Joseph         Mount Mary College                 
## [13] Siena Heights College               Marian College of Fond du Lac      
## [15] College of St. Joseph               Goshen College                     
## [17] Pine Manor College                  Mount Saint Clare College          
## [19] Curry College                       Lourdes College                    
## [21] Principia College                   Brescia College                    
## [23] New England College                 Lasell College                     
## [25] Montana State University - Northern
## 938 Levels: Abilene Christian University Adelphi University ... Youngstown State University

4.3 Conclusion:

Once we have done the different PCA models, we can compare them and select what configuration best suits with our goals. We see that both models are very similar in terms of the meaning of the principal components. In this case the preferred choice is:

Data with Public/Private, (4.1.), because this variable allow us to rank universities by a quality indicator.

5 Factor Analysis:

Factor Analysis is an analytical tool where the focus is to explain correlations between indicators (by common factors), and will be made by reducing the dimension.

As well as with PCA, we will consider two approaches: considering the Public/Private indicator and not. Moreover, we would consider also two and three factors respectively.

5.1 Data with Public/Private:

5.1.1 Two Factors:

  • Rotation = “varimax”, Scores=“Bartlett”:
f = 2
r = "varimax"
s = "Bartlett"
data_fa = factanal(data, factors = f, rotation=r, scores=s)
data_fa
## 
## Call:
## factanal(x = data, factors = f, scores = s, rotation = r)
## 
## Uniquenesses:
##   Public_Private        Top25perc Faculty_Terminal         SF_Ratio 
##            0.182            0.450            0.487            0.557 
##      Perc_Donate    Instructional        Grad_Rate         Acc_Rate 
##            0.632            0.394            0.571            0.715 
##        TotalCost      Total_Under 
##            0.257            0.444 
## 
## Loadings:
##                  Factor1 Factor2
## Public_Private            0.904 
## Top25perc         0.738         
## Faculty_Terminal  0.694  -0.180 
## SF_Ratio         -0.376  -0.549 
## Perc_Donate       0.411   0.446 
## Instructional     0.726   0.282 
## Grad_Rate         0.512   0.409 
## Acc_Rate         -0.532         
## TotalCost         0.695   0.510 
## Total_Under       0.254  -0.701 
## 
##                Factor1 Factor2
## SS loadings      2.956   2.356
## Proportion Var   0.296   0.236
## Cumulative Var   0.296   0.531
## 
## Test of the hypothesis that 2 factors are sufficient.
## The chi square statistic is 360.77 on 26 degrees of freedom.
## The p-value is 1.21e-60

We can see that the proportion of total variance that it is explained is 53%. The meaning of the First factor is how good a university is, in spite of the Second factor, that explain the popularity.

  • Rotation = “none”, Scores = “Regression”:
f = 2
r = "none"
s = "regression"
data_fa = factanal(data, factors = f, rotation=r, scores=s)
data_fa
## 
## Call:
## factanal(x = data, factors = f, scores = s, rotation = r)
## 
## Uniquenesses:
##   Public_Private        Top25perc Faculty_Terminal         SF_Ratio 
##            0.182            0.450            0.487            0.557 
##      Perc_Donate    Instructional        Grad_Rate         Acc_Rate 
##            0.632            0.394            0.571            0.715 
##        TotalCost      Total_Under 
##            0.257            0.444 
## 
## Loadings:
##                  Factor1 Factor2
## Public_Private    0.717  -0.551 
## Top25perc         0.520   0.529 
## Faculty_Terminal  0.294   0.654 
## SF_Ratio         -0.664         
## Perc_Donate       0.605         
## Instructional     0.674   0.390 
## Grad_Rate         0.639   0.144 
## Acc_Rate         -0.304  -0.438 
## TotalCost         0.833   0.223 
## Total_Under      -0.388   0.637 
## 
##                Factor1 Factor2
## SS loadings      3.477   1.835
## Proportion Var   0.348   0.184
## Cumulative Var   0.348   0.531
## 
## Test of the hypothesis that 2 factors are sufficient.
## The chi square statistic is 360.77 on 26 degrees of freedom.
## The p-value is 1.21e-60

But with this configuration also we get exactly the same total variance.

For this reason we will consider also three factors. But first, let´s plot the two factors in a 2D graph just to see if there are some hidden insights:

fa_plot = autoplot(data_fa, label = TRUE, label.size = 3, shape = FALSE,
         loadings = TRUE, loadings.label = TRUE, loadings.label.size  = 5)
plotly::ggplotly(fa_plot)

It is very interesting how the plot of the Factor Analysis model is represented. We can see two differentiated groups that according to the loadings, correspond to the Public_Private indicator. Let´s check it:

fa_plot = autoplot(data_fa, label = TRUE, label.size = 3, shape = FALSE, colour = college.pub_priv, legend = TRUE,
         loadings = TRUE, loadings.label = TRUE, loadings.label.size  = 5)
plotly::ggplotly(fa_plot)

5.1.2 Three Factors:

Previously we have seen that with two factors the total variance that the model explained was around 53% (independent of the rotation and scores used). For this reason we would consider three factors to see if it is better.

  • Rotation = “varimax”, Scores=“Bartlett”:
f = 3
r = "varimax"
s = "Bartlett"
data_fa = factanal(data, factors = f, rotation=r, scores=s)
data_fa
## 
## Call:
## factanal(x = data, factors = f, scores = s, rotation = r)
## 
## Uniquenesses:
##   Public_Private        Top25perc Faculty_Terminal         SF_Ratio 
##            0.171            0.419            0.457            0.515 
##      Perc_Donate    Instructional        Grad_Rate         Acc_Rate 
##            0.605            0.005            0.513            0.712 
##        TotalCost      Total_Under 
##            0.280            0.450 
## 
## Loadings:
##                  Factor1 Factor2 Factor3
## Public_Private            0.901         
## Top25perc         0.708           0.283 
## Faculty_Terminal  0.664  -0.251   0.196 
## SF_Ratio         -0.223  -0.487  -0.445 
## Perc_Donate       0.454   0.402   0.164 
## Instructional     0.401   0.149   0.901 
## Grad_Rate         0.581   0.362   0.138 
## Acc_Rate         -0.400   0.105  -0.341 
## TotalCost         0.622   0.424   0.391 
## Total_Under       0.183  -0.718         
## 
##                Factor1 Factor2 Factor3
## SS loadings      2.287   2.134   1.450
## Proportion Var   0.229   0.213   0.145
## Cumulative Var   0.229   0.442   0.587
## 
## Test of the hypothesis that 3 factors are sufficient.
## The chi square statistic is 172.22 on 18 degrees of freedom.
## The p-value is 3.31e-27

We can see that the proportion of total variance that it is explained is 59%.

  • Rotation = “none”, Scores = “Regression”:
f = 3
r = "none"
s = "regression"
data_fa = factanal(data, factors = f, rotation=r, scores=s)
data_fa
## 
## Call:
## factanal(x = data, factors = f, scores = s, rotation = r)
## 
## Uniquenesses:
##   Public_Private        Top25perc Faculty_Terminal         SF_Ratio 
##            0.171            0.419            0.457            0.515 
##      Perc_Donate    Instructional        Grad_Rate         Acc_Rate 
##            0.605            0.005            0.513            0.712 
##        TotalCost      Total_Under 
##            0.280            0.450 
## 
## Loadings:
##                  Factor1 Factor2 Factor3
## Public_Private    0.260   0.872         
## Top25perc         0.546           0.530 
## Faculty_Terminal  0.410  -0.266   0.551 
## SF_Ratio         -0.570  -0.400         
## Perc_Donate       0.401   0.375   0.307 
## Instructional     0.997                 
## Grad_Rate         0.424   0.343   0.436 
## Acc_Rate         -0.454   0.158  -0.237 
## TotalCost         0.677   0.358   0.366 
## Total_Under              -0.704   0.232 
## 
##                Factor1 Factor2 Factor3
## SS loadings      2.858   1.899   1.115
## Proportion Var   0.286   0.190   0.111
## Cumulative Var   0.286   0.476   0.587
## 
## Test of the hypothesis that 3 factors are sufficient.
## The chi square statistic is 172.22 on 18 degrees of freedom.
## The p-value is 3.31e-27

The result is that it explain 59% of total variance, slightly higher than with two factors. Not enough to say that it is a better model than with two factors.

5.2 Data without Public/Private:

Once we have computed the models considering the Public/Private variable, we could do the same but not considering it:

5.2.1 Two Factors:

  • Rotation = “varimax”, Scores=“Bartlett”:
f = 2
r = "varimax"
s = "Bartlett"
out_data_fa = factanal(out_data, factors = f, rotation=r, scores=s)
out_data_fa
## 
## Call:
## factanal(x = out_data, factors = f, scores = s, rotation = r)
## 
## Uniquenesses:
##        Top25perc Faculty_Terminal         SF_Ratio      Perc_Donate 
##            0.426            0.496            0.516            0.626 
##    Instructional        Grad_Rate         Acc_Rate        TotalCost 
##            0.381            0.580            0.723            0.293 
##      Total_Under 
##            0.439 
## 
## Loadings:
##                  Factor1 Factor2
## Top25perc         0.372   0.660 
## Faculty_Terminal  0.148   0.694 
## SF_Ratio         -0.692         
## Perc_Donate       0.586   0.173 
## Instructional     0.611   0.496 
## Grad_Rate         0.579   0.291 
## Acc_Rate         -0.224  -0.476 
## TotalCost         0.733   0.412 
## Total_Under      -0.520   0.539 
## 
##                Factor1 Factor2
## SS loadings      2.550   1.970
## Proportion Var   0.283   0.219
## Cumulative Var   0.283   0.502
## 
## Test of the hypothesis that 2 factors are sufficient.
## The chi square statistic is 308.44 on 19 degrees of freedom.
## The p-value is 3.71e-54
  • Rotation = “none”, Scores=“regression”:
f = 2
r = "none"
s = "Bartlett"
out_data_fa = factanal(out_data, factors = f, rotation=r, scores=s)
out_data_fa
## 
## Call:
## factanal(x = out_data, factors = f, scores = s, rotation = r)
## 
## Uniquenesses:
##        Top25perc Faculty_Terminal         SF_Ratio      Perc_Donate 
##            0.426            0.496            0.516            0.626 
##    Instructional        Grad_Rate         Acc_Rate        TotalCost 
##            0.381            0.580            0.723            0.293 
##      Total_Under 
##            0.439 
## 
## Loadings:
##                  Factor1 Factor2
## Top25perc         0.695   0.303 
## Faculty_Terminal  0.536   0.465 
## SF_Ratio         -0.595   0.360 
## Perc_Donate       0.572  -0.215 
## Instructional     0.786         
## Grad_Rate         0.638  -0.117 
## Acc_Rate         -0.466  -0.245 
## TotalCost         0.833  -0.112 
## Total_Under               0.744 
## 
##                Factor1 Factor2
## SS loadings      3.395   1.124
## Proportion Var   0.377   0.125
## Cumulative Var   0.377   0.502
## 
## Test of the hypothesis that 2 factors are sufficient.
## The chi square statistic is 308.44 on 19 degrees of freedom.
## The p-value is 3.71e-54

Both models explain the 50% of variance with two factors. The meaning of the First factor is also how good a university is, in spite of the Second factor, that explain the popularity.

Let´s see the plot:

fa_plot = autoplot(out_data_fa, label = TRUE, label.size = 3, shape = FALSE,
         loadings = TRUE, loadings.label = TRUE, loadings.label.size  = 5)
plotly::ggplotly(fa_plot)

Now it is not that clear as with two factors. However, to remark something is that the observations have like a square distribution, in which the branches are determined by the most important variables in each factor, Total_Under and for example, Total_Cost.

5.3 Conclusion:

We have considered 2 variations in the data (with and without Public/Private), 2 different numbers of factor (2 and 3) and 2 different configurations of rotation and scores. In total 8 different models. However, the results were very similar in terms of total variance explained. For our purpose, the choice is:

Two factors with rotation = “varimax” and scores = “Bartlett”, because variance is 53% and in the plot we saw that colleges are grouped good by a sense of Public/Private.

6 Clustering:

The final unsupervised learning tool that we will use in this project is clustering. The focus is to reduce dimensionality of observations by grouping them into clusters. The interesting and preferred thing is to maximize the distance between different clusters and minimize it between observations of the same group.

We would consider different approaches of clustering: - K-Means - Mahalanobis K-Means - Hierarchical - PAM - Kernel K-Means

First, we would load some crucial libraries for clustering:

library(factoextra)
library(cluster)
library(mclust)

6.1 Data with Public/Private:

6.1.1 K-Means:

Because clustering is a tool that uses distances, the recommendable thing to do is to scale the data.

X_data = scale(data)

It is also important to determine the number of optimal clusters. We would make some plots using silhouette and WSS methods to determine it. We see that we could choose either 2 and 3 (gap_stat is not considered because it is more time consuming and the results of the previous methods are acceptable).

fviz_nbclust(X_data, kmeans, method = "silhouette", nstart = 1000, linecolor = "#69b3a2")

fviz_nbclust(X_data, kmeans, method = "wss", nstart = 1000, linecolor = "#69b3a2")

Before starting with the models, let´s consider the PCA plot in order to see how the observations are going to be placed and to realize the meaning with the loadings of the possible clusters:

library(ggfortify)
pca_plot = autoplot(pca, data = data,
         label = TRUE, label.size = 3, shape = FALSE,
         loadings = TRUE, loadings.colour = 'red',
         loadings.label = TRUE, loadings.label.size = 5)

library(plotly)
plotly::ggplotly(pca_plot)

6.1.1.1 Two clusters:

Above average universities vs Below average universities.

# K=2:
k = 2
km2 = kmeans(scale(data), center = k, nstart = 1000)
fviz_cluster(km2, data = scale(data), geom = c("point"),ellipse.type = 'norm', pointsize=1, main = "K-Means clusplot with 2 clusters")+
  theme_minimal()+geom_text(label=row.names(data),hjust=0, vjust=0,size=2,check_overlap = T)+scale_fill_brewer(palette="Greens")

We can see that we have two differentiated groups. According to the previous interpretation in 4.1. of the PC, we see that the two clusters are more differentiated by the first component. More to the left indicates that the college is better than the average college and the other way to the right.

Let´s see how good are the groups:
As we can see in both plots, groups can be considered as good taking into account the silhouette widths. On the other hand, they are not that well-balanced but this seems acceptable.

barplot(table(km2$cluster), col="#69b3a2", main = "Number of colleges per cluster", sub = " Hypothesis: 1 Above Av. / 2 Below Av.")

d = dist(data, method="euclidean")  
data_sil = silhouette(km2$cluster, d)
plot(data_sil, col=1:2, main="", border=NA)

6.1.1.2 Three clusters:

Above average universities vs Below average public universities vs Below average private universities.

#K=3:
k = 3
km3 = kmeans(scale(data), center = k, nstart = 1000)
fviz_cluster(km3, data = scale(data), geom = c("point"),ellipse.type = 'norm', pointsize=1, main = "K-Means clusplot with 3 clusters")+
  theme_minimal()+geom_text(label=row.names(data),hjust=0, vjust=0,size=2,check_overlap = T)+scale_fill_brewer(palette="Greens")

We can see that we have three differentiated groups. According to the previous interpretation in 4.1. of the PC, we see that the interpretation of the three clusters is similar to the one with two clusters. More to the left implies that the college is better and more up indicates private.

Let´s see how good are the groups:
As we can see in both plots, groups can be considered also as good taking into account the silhouette widths. On the other hand, they are not that well-balanced but this seems acceptable as with two clusters.

barplot(table(km3$cluster), col="#69b3a2", main = "Number of colleges per cluster", sub = "Hypothesis: 1 Below Av. Private / 2 Below Av.Public / 3 Above Av")

d = dist(data, method="euclidean")  
data_sil = silhouette(km3$cluster, d)
plot(data_sil, col=1:3, main="", border=NA)

6.1.2 Mahalanobis K-Means:

Now, we could make a change in the input of the K-Means model: distance of the observations. In this case we would make use of the Mahalanobis distance. In R there is not a specific library or function to apply directly it, so we would need to construct it step by step:

S_x = cov(data) #  covariance matrix
inv_Sx = solve(S_x) # S_x^-1 (inverse of S_x)
eigen_Sx = eigen(inv_Sx)
vector_Sx <- eigen_Sx$vectors
Y = vector_Sx %*% diag(sqrt(eigen_Sx$values)) %*% t(vector_Sx) # inv_Sx^1/2
X_til = scale(data, scale = FALSE)
data_Sx = X_til %*% Y

6.1.2.1 Two clusters:

Private universities vs Public universities

Let´s consider first two clusters:

k = 2
km2.mahalanobis = kmeans(data_Sx, centers=k, nstart=1000)

In the clusplot we see that the two clusters are well differentiated. Taking into account the interpretation of the PC, we realize that actually this clusters represents private and public universities.

fviz_cluster(km2.mahalanobis, data, geom = c("point"),ellipse.type = 'norm', pointsize=1, main = "K-Means with Mahalanobis distance clusplot with 2 clusters")+
  theme_minimal()+geom_text(label=college.names,hjust=0, vjust=0,size=2,check_overlap = T)+scale_fill_brewer(palette="Greens")

This interpretation is precise as we can see in the following plot. For cluster 1 we have that most of colleges are Private, and for cluster 2 we have Public universities:

as.data.frame(X_data) %>% mutate(cluster=factor(km2.mahalanobis$cluster), names=college.names, pub_priv=college.pub_priv) %>%
ggplot(aes(x=as.factor(pub_priv), fill=as.factor(pub_priv) )) + geom_bar( ) + facet_wrap(~cluster) + scale_fill_brewer(palette = "Greens") + labs(title = "Distribution of Private (1) and Public (2) by cluster", x = "", y = "", fill = "Type (1: Public / 2: Private)") 

6.1.2.2 Three clusters:

Above Av. Private universities vs Below Av. Private universities vs Public universities

Let´s consider now 3 clusters and compare the results with 2 clusters:

k = 3
km3.mahalanobis = kmeans(data_Sx, centers=k, nstart=1000)

In this case the interpretation is not as easy as with two clusters.

fviz_cluster(km3.mahalanobis, data, geom = c("point"),ellipse.type = 'norm', pointsize=1, main = "K-Means with Mahalanobis distance clusplot with 2 clusters")+
  theme_minimal()+geom_text(label=college.names,hjust=0, vjust=0,size=2,check_overlap = T)+scale_fill_brewer(palette="Greens")

However, if we remember the plot with 3 clusters in K-Means in 6.1.1., we can see that is similar but not the same. In this case we have:

adjustedRandIndex(km3.mahalanobis$cluster, km3$cluster)
## [1] 0.5567912

6.1.3 Hierarchical Clustering:

Another clustering tool but different from the ones we have already tested is Hierarchical Clustering. In this case, the focus is to give some hierarchy to the observations represented in a dendogram. Close observations will be grouped until we get one group with all the observations. A downside of this tool is that for large data sets is difficult to manage due to time consuming and interpretation of the results. For this reason:

Code is going to be commented to avoid time-consuming and visualization problems

However, we are going to consider different distances and methods, and that the optimal clusters are k = 2. Notice that phylogenic trees are also not considered for the same reason as the dendograms.

# dis = "euclidean"
# met = "complete"
# d = dist(data, method = dis)
# hc = hclust(d, method = met)
# fviz_dend(hc, k = 2, cex = 0.6, rect = T, repel = T)
# dis = "manhattan"
# met = "single"
# d = dist(data, method = dis)
# hc = hclust(d, method = met)
# fviz_dend(hc, k = 2, cex = 0.6, rect = T, repel = T)
# dis = "canberra"
# met = "average"
# d = dist(data, method = dis)
# hc = hclust(d, method = met)
# fviz_dend(hc, k = 2, cex = 0.6, rect = T, repel = T)

6.1.4 PAM:

Before starting with the models we need to determine the number of optimal clusters. We would make some plots using silhouette and WSS methods to determine it. We see that we could choose either 2 and 3 (gap_stat is not considered because it is more time consuming and the results of the previous methods are acceptable).

fviz_nbclust(X_data, kmeans, method = "silhouette", nstart = 1000, linecolor = "#69b3a2")

fviz_nbclust(X_data, kmeans, method = "wss", nstart = 1000, linecolor = "#69b3a2")

6.1.4.1 Two clusters:

Private universities vs Public universities

data_pam2 = eclust(X_data, "pam", stand=TRUE, k=2, graph=F)

fviz_cluster(data_pam2, data = X_data, geom = c("point"), pointsize=1, main = "PAM clusplot with 2 clusters")+
  theme_minimal()+geom_text(label=college.names,hjust=0, vjust=0,size=2,check_overlap = T)+scale_fill_brewer(palette="Greens")

In PAM, with 2 clusters we have the same interpretation than in 6.1.2. (Mahalanobis with 2 clusters):

adjustedRandIndex(km2.mahalanobis$cluster, data_pam2$clustering)
## [1] 0.8301378

6.1.4.2 Three clusters:

Above average universities vs Below average public universities vs Below average private universities.

data_pam3 = eclust(X_data, "pam", stand=TRUE, k=3, graph=F)

fviz_cluster(data_pam3, data = X_data, geom = c("point"), pointsize=1, main = "PAM clusplot with 3 clusters")+
  theme_minimal()+geom_text(label=college.names,hjust=0, vjust=0,size=2,check_overlap = T)+scale_fill_brewer(palette="Greens")

With 3 clusters we also have the same interpretation than in 6.1.1. (K-Means with 3 clusters):

adjustedRandIndex(km3$cluster, data_pam3$clustering)
## [1] 0.6533104

6.1.5 Kernel K-Means:

Kernel K-Means is a tool that allow us to capture non-linearly separable clusters in the input dimension.

library(kernlab)

6.1.5.1 Two clusters:

We can choose different kernels to see which is the one that best fit with our data:

With two clusters and gaussian kernel and taking into account the plot at the beginning of section 6, we see that the interpretation could be:

More know universities vs Less known universities.

# model:
ker = "rbfdot"
data_ker = kkmeans(as.matrix(X_data), centers=2, kernel=ker)
## Using automatic sigma estimation (sigest) for RBF or laplace kernel
# plot:
obj.data_ker21 = list(data = X_data, cluster = data_ker@.Data)
fviz_cluster(obj.data_ker21, geom = c("point"), ellipse=F,pointsize=1)+
  theme_minimal()+geom_text(label=college.names,hjust=0, vjust=0,size=2,check_overlap = T)+scale_fill_brewer(palette="Greens")

Above average universities vs Below average universities.

# model:
ker = "polydot"
data_ker = kkmeans(as.matrix(X_data), centers=2, kernel=ker)
##  Setting default kernel parameters
# plot:
obj.data_ker22 = list(data = X_data, cluster = data_ker@.Data)
fviz_cluster(obj.data_ker22, geom = c("point"), ellipse=F,pointsize=1)+
  theme_minimal()+geom_text(label=college.names,hjust=0, vjust=0,size=2,check_overlap = T)+scale_fill_brewer(palette="Greens")

With two clusters and polynomial kernel we see that it is a similar interpretation than in 6.1.1. (K-Means with two clusters).

adjustedRandIndex(km2$cluster, obj.data_ker22$cluster)
## [1] 0.9451335

6.1.5.2 Three clusters:

With three clusters and gaussian kernel and taking into account the plot at the beginning of section 6, we see that the interpretation could be:

More know universities vs More or less known universities vs Less known universities.

# model:
ker = "rbfdot"
data_ker = kkmeans(as.matrix(X_data), centers=3, kernel=ker)
## Using automatic sigma estimation (sigest) for RBF or laplace kernel
# plot:
obj.data_ker31 = list(data = X_data, cluster = data_ker@.Data)
fviz_cluster(obj.data_ker31, geom = c("point"), ellipse=F,pointsize=1)+
  theme_minimal()+geom_text(label=college.names,hjust=0, vjust=0,size=2,check_overlap = T)+scale_fill_brewer(palette="Greens")

Above average universities vs Below average public universities vs Below average private universities.

# model:
ker = "laplacedot"
data_ker = kkmeans(as.matrix(X_data), centers=3, kernel=ker)
## Using automatic sigma estimation (sigest) for RBF or laplace kernel
# plot:
obj.data_ker32 = list(data = X_data, cluster = data_ker@.Data)
fviz_cluster(obj.data_ker32, geom = c("point"), ellipse=F,pointsize=1)+
  theme_minimal()+geom_text(label=college.names,hjust=0, vjust=0,size=2,check_overlap = T)+scale_fill_brewer(palette="Greens")

With three clusters and polynomial kernel we see that the interpretation is a similar to the one in 6.1.1. (K-Means with three clusters).

adjustedRandIndex(km3$cluster, obj.data_ker32$cluster)
## [1] 0.5760837

6.2 Data without Public/Private:

Once we have computed the models considering the Public/Private variable, we could do the same but not considering it:

6.2.1 K-Means:

Because clustering is a tool that uses distances, the recommendable thing to do is to scale the data.

X_out_data = scale(out_data)

It is also important to determine the number of optimal clusters. We would make some plots using silhouette and WSS methods to determine it. We see that we could choose 2 (gap_stat is not considered because it is more time consuming and the results of the previous methods are acceptable.

fviz_nbclust(X_out_data, kmeans, method = "silhouette", nstart = 1000, linecolor = "#9E1030FF")

fviz_nbclust(X_out_data, kmeans, method = "wss", nstart = 1000, linecolor = "#9E1030FF")

Before starting with the models, let´s consider the PCA plot in order to see how the observations are going to be placed and to realize the meaning with the loadings of the possible clusters:

library(ggfortify)
pca_plot = autoplot(pca_out, data = out_data,
         label = TRUE, label.size = 3, shape = FALSE,
         loadings = TRUE, loadings.colour = 'red',
         loadings.label = TRUE, loadings.label.size = 5)

library(plotly)
plotly::ggplotly(pca_plot)

Above average universities vs Below average universities.

# K=2:
k = 2
km2_out = kmeans(scale(out_data), center = k, nstart = 1000)
fviz_cluster(km2_out, data = scale(out_data), geom = c("point"),ellipse.type = 'norm', pointsize=1, main = "K-Means clusplot with 2 clusters")+
  theme_minimal()+geom_text(label=row.names(out_data),hjust=0, vjust=0,size=2,check_overlap = T)+scale_fill_brewer(palette="Purples")

We can see that we have two differentiated groups. According to the previous interpretation in 4.1. of the PC, we see that the two clusters are more differentiated by the first component. More to the left indicates that the college is better than the average college and the other way to the right.

Let´s see how good are the groups:
As we can see in both plots, groups can be considered as good taking into account the silhouette widths. On the other hand, they are not that well-balanced but this seems acceptable.

barplot(table(km2_out$cluster), col="#9E1030FF", main = "Number of colleges per cluster", sub = " Hypothesis: 1 Above Av. / 2 Below Av.")

d = dist(out_data, method="euclidean")  
out_data_sil = silhouette(km2_out$cluster, d)
plot(out_data_sil, col=1:2, main="", border=NA)

summary(out_data_sil) 
## Silhouette of 938 units in 2 clusters from silhouette.default(x = km2_out$cluster, dist = d) :
##  Cluster sizes and average silhouette widths:
##       303       635 
## 0.1618806 0.3719111 
## Individual silhouette widths:
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
## -0.5005  0.2018  0.3716  0.3041  0.4811  0.5526

6.2.2 Mahalanobis K-Means:

Now, we could make a change in the input of the K-Means model: distance of the observations. In this case we would make use of the Mahalanobis distance. In R there is not a specific library or function to apply directly it, so we would need to construct it step by step:

S_x_out = cov(out_data) #  covariance matrix
inv_Sx_out = solve(S_x_out) # S_x^-1 (inverse of S_x)
eigen_Sx_out = eigen(inv_Sx_out)
vector_Sx_out <- eigen_Sx_out$vectors
Y_out = vector_Sx_out %*% diag(sqrt(eigen_Sx_out$values)) %*% t(vector_Sx_out) # inv_Sx_out^1/2
X_til_out = scale(out_data, scale = FALSE)
out_data_Sx = X_til_out %*% Y_out

More popular universities vs Less popular universities

Let´s consider first two clusters:

k = 2
km2_out.mahalanobis = kmeans(out_data_Sx, centers=k, nstart=1000)

In the clusplot we see that the interpretation is not straightforward because we have some overlapping on the groups. This model seems that it is not properly working for our intentions. WE can guess that the meaning is more popular and less popular universities.

fviz_cluster(km2_out.mahalanobis, out_data, geom = c("point"),ellipse.type = 'norm', pointsize=1, main = "K-Means with Mahalanobis distance clusplot with 2 clusters")+
  theme_minimal()+geom_text(label=college.names,hjust=0, vjust=0,size=2,check_overlap = T)+scale_fill_brewer(palette="Purples")

6.2.3 Hierarchical Clustering:

Another clustering tool but different from the ones we have already tested is Hierarchical Clustering. In this case, the focus is to give some hierarchy to the observations represented in a dendogram. Close observations will be grouped until we get one group with all the observations. A downside of this tool is that for large data sets is difficult to manage due to time consuming and interpretation of the results. For this reason:

Code is going to be commented to avoid time-consuming and visualization problems

However, we are going to consider different distances and methods, and that the optimal clusters are k = 2. Notice that phylogenic trees are also not considered for the same reason as the dendograms.

# dis = "euclidean"
# met = "complete"
# d = dist(out_data, method = dis)
# hc = hclust(d, method = met)
# fviz_dend(hc, k = 2, cex = 0.6, rect = T, repel = T)
# dis = "manhattan"
# met = "single"
# d = dist(out_data, method = dis)
# hc = hclust(d, method = met)
# fviz_dend(hc, k = 2, cex = 0.6, rect = T, repel = T)
# dis = "canberra"
# met = "average"
# d = dist(out_data, method = dis)
# hc = hclust(d, method = met)
# fviz_dend(hc, k = 2, cex = 0.6, rect = T, repel = T)

6.2.4 PAM:

Before starting with the models we need to determine the number of optimal clusters. We would make some plots using silhouette and WSS methods to determine it. We see that we could choose 2 (gap_stat is not considered because it is more time consuming and the results of the previous methods are acceptable.

fviz_nbclust(X_out_data, kmeans, method = "silhouette", nstart = 1000, linecolor = "#9E1030FF")

fviz_nbclust(X_out_data, kmeans, method = "wss", nstart = 1000, linecolor = "#9E1030FF")

Above average universities vs Below average universities

out_data_pam2 = eclust(X_out_data, "pam", stand=TRUE, k=2, graph=F)

fviz_cluster(out_data_pam2, data = X_out_data, geom = c("point"), pointsize=1, main = "PAM clusplot with 2 clusters")+
  theme_minimal()+geom_text(label=college.names,hjust=0, vjust=0,size=2,check_overlap = T)+scale_fill_brewer(palette="Purples")

6.2.5 Kernel K-Means:

Kernel K-Means is a tool that allow us to capture non-linearly separable clusters in the input dimension.

library(kernlab)

We can choose different kernels to see which is the one that best fit with our data:

With two clusters and gaussian kernel and taking into account the plot at the beginning of section 6, we see that the interpretation could be:

More know universities vs Less known universities.

# model:
ker = "rbfdot"
out_data_ker = kkmeans(as.matrix(X_out_data), centers=2, kernel=ker)
## Using automatic sigma estimation (sigest) for RBF or laplace kernel
# plot:
obj.out_data_ker21 = list(data = X_out_data, cluster = out_data_ker@.Data)
fviz_cluster(obj.out_data_ker21, geom = c("point"), ellipse=F,pointsize=1)+
  theme_minimal()+geom_text(label=college.names,hjust=0, vjust=0,size=2,check_overlap = T)+scale_fill_brewer(palette="Purples")

Above average universities vs Below average universities.

# model:
ker = "polydot"
out_data_ker = kkmeans(as.matrix(X_out_data), centers=2, kernel=ker)
##  Setting default kernel parameters
# plot:
obj.out_data_ker22 = list(data = X_out_data, cluster = out_data_ker@.Data)
fviz_cluster(obj.out_data_ker22, geom = c("point"), ellipse=F,pointsize=1)+
  theme_minimal()+geom_text(label=college.names,hjust=0, vjust=0,size=2,check_overlap = T)+scale_fill_brewer(palette="Purples")

6.3 Conclusion:

Computing the adjustedRandIndex would give us a tool to see if models with the two datasets are similar or not. We can only compare models with 2 clusters, and we see that most of the models are around 50%, so this tell us that adding the Public/Private variable is determinant in the outcome. However, in the models with K_Means the value is 65% and that is due to the interpretation was Above/Below average, that has nothing to do with if the college is private or public.

adjustedRandIndex(km2$cluster, km2_out$cluster)
## [1] 0.6334568

K-Means 2 clusters

adjustedRandIndex(km2.mahalanobis$cluster, km2_out.mahalanobis$cluster)
## [1] 0.4952633

Mahalanobis 2 clusters

adjustedRandIndex(data_pam2$cluster, out_data_pam2$cluster)
## [1] 0.08283191

PAM 2 clusters

adjustedRandIndex(obj.data_ker21$cluster, obj.out_data_ker22$cluster)
## [1] 0.001535154

Kernel 2 clusters

7 Conclusions:

7.1 Selected models:

After computing the different models for each dataset (with and without Public/Private) and comparing them, we can select the one that best fit with our intention. In this case, we would choose two models, one for two and one for three clusters. This is because with two clusters it is interesting the Public/Private groups, and with three clusters the difference according to a quality indicator (above/below average).

2 clusters: Private universities vs Public universities

k = 2
km2.mahalanobis = kmeans(data_Sx, centers=k, nstart=1000)

fviz_cluster(km2.mahalanobis, data, geom = c("point"),ellipse.type = 'norm', pointsize=1, main = "K-Means with Mahalanobis distance clusplot with 2 clusters")+
  theme_minimal()+geom_text(label=college.names,hjust=0, vjust=0,size=2,check_overlap = T)+scale_fill_brewer(palette="Greens")

3 clusters: Above average universities vs Below average public universities vs Below average private universities.

data_pam3 = eclust(X_data, "pam", stand=TRUE, k=3, graph=F)

fviz_cluster(data_pam3, data = X_data, geom = c("point"), pointsize=1, main = "PAM clusplot with 3 clusters")+
  theme_minimal()+geom_text(label=college.names,hjust=0, vjust=0,size=2,check_overlap = T)+scale_fill_brewer(palette="Greens")

df = data.frame(cluster=data_pam3$cluster, state=college.states) %>% group_by(state)  %>%
  summarise(X = mean(cluster))

plot_usmap(data=df, values = "X") + scale_fill_continuous(
    low = "white", high = "red", name = "Population (2015)")

7.2 Insights of models with Profiles:

After selecting the best models we can use the profiles defined in 2.5 to answer some natural questions:

Better universities are the ones which TotalCost is higher?
These should be a natural question that should be affirmative, and in fact, it is:

p1 = data.frame(z1=pca$x[,1],z2=pca$x[,2]) %>% 
  ggplot(aes(z1,z2,label=college.names,color=data$TotalCost)) + geom_point(size=0) +
  labs(title="Universities map according to TotalCost", x="PC1", y="PC2") +
  theme_bw() + scale_color_gradient(low="lightblue", high="darkblue")+theme(legend.position="bottom") + geom_text(size=2, hjust=0.6, vjust=0, check_overlap = TRUE) 


p2 = fviz_cluster(data_pam3, data = X_data, geom = c("point"), pointsize=1, main = "PAM clusplot with 3 clusters")+
  theme_minimal()+geom_text(label=college.names,hjust=0, vjust=0,size=2,check_overlap = T)+scale_fill_brewer(palette="Greens")

ggarrange(p1, p2, ncol = 1, nrow = 2)

Better universities are Private? Comparing the plot of the model of K-means selected and the graph of the distribution of colleges according to Public/Private, we confirm that our initial guess is true:

p1 = data.frame(z1=pca$x[,1],z2=pca$x[,2]) %>% 
  ggplot(aes(z1,z2,label=college.names,color=college.pub_priv)) + geom_point(size=0) +
  labs(title="Universities map according to Public/Private", x="PC1", y="PC2") +
  theme_bw() + scale_color_gradient(low="lightblue", high="darkblue")+theme(legend.position="bottom") + geom_text(size=2, hjust=0.6, vjust=0, check_overlap = TRUE)

p2 = fviz_cluster(km2.mahalanobis, data, geom = c("point"),ellipse.type = 'norm', pointsize=1, main = "K-Means with Mahalanobis distance clusplot with 2 clusters")+
  theme_minimal()+geom_text(label=college.names,hjust=0, vjust=0,size=2,check_overlap = T)+scale_fill_brewer(palette="Greens")

ggarrange(p1, p2, ncol = 1, nrow = 2)

Better universities are the ones that have low Acceptance rate?
We can compare the plot of the clustering model we have selected and a plot with the distribution of colleges according to Acc_Rate. We see that, effectively, acceptance rate or in other words, exclusivity, affects on how good a college is:

p1 = data.frame(z1=pca$x[,1],z2=pca$x[,2]) %>% 
  ggplot(aes(z1,z2,label=college.names,color=data$Acc_Rate)) + geom_point(size=0) +
  labs(title="Universities map according to Acc_Rate", x="PC1", y="PC2") +
  theme_bw() + scale_color_gradient(low="yellow", high="darkblue")+theme(legend.position="bottom") + geom_text(size=2, hjust=0.6, vjust=0, check_overlap = TRUE) 

p2 = fviz_cluster(data_pam3, data = X_data, geom = c("point"), pointsize=1, main = "PAM clusplot with 3 clusters")+
  theme_minimal()+geom_text(label=college.names,hjust=0, vjust=0,size=2,check_overlap = T)+scale_fill_brewer(palette="Greens")

ggarrange(p1, p2, ncol = 1, nrow = 2)

Does the Region affects the position of universities?

We cannot affirm that because there is no clear pattern in general. However, what we could say that better and private universities are more likely to be in the NorthEast region. This is a natural answer as we take into account that in that region is where the Ivy League is (https://www.usnews.com/education/best-colleges/ivy-league-schools).

p1 = data.frame(z1=pca$x[,1],z2=pca$x[,2]) %>% 
  ggplot(aes(z1,z2,label=college.names,color=college.elite)) + geom_point(size=0) +
  labs(title="PCA", x="PC1", y="PC2") +
  theme_bw() + geom_text(size=2, hjust=0.6, vjust=0, check_overlap = TRUE)

p2 = data.frame(z1=pca$x[,1],z2=pca$x[,2]) %>% 
  ggplot(aes(z1,z2,label=college.states,color=college.regions)) + geom_point(size=0) +
  labs(title="Universities map according to TotalCost", x="PC1", y="PC2") +
  theme_bw() + theme(legend.position="bottom") + geom_text(size=2, hjust=0.6, vjust=0, check_overlap = TRUE)

ggarrange(p1, p2, ncol = 1, nrow = 2)

We can also confirm it by looking at the us map for TotalCost and Acc_Rate. We see that universities in the NorthEast region (IvyLeague) and in the West (Caltech, Stanford,…) have a higher cost:

df = data.frame(cost=data$TotalCost, state=college.states) %>% group_by(state)  %>%
  summarise(X = median(cost))

p1 = plot_usmap(data=df, values = "X") + scale_fill_continuous(
    low = "white", high = "#CB454A", name = "Total Cost") + labs(title = "Distribution of universities in the US by Total Cost") 
df = data.frame(acc=data$Acc_Rate, state=college.states) %>% group_by(state)  %>%
  summarise(X = mean(acc))

p2 = plot_usmap(data=df, values = "X") + scale_fill_continuous(
    low = "#CB454A", high = "white", name = "Acc Rate") + labs(title = "Distribution of universities in the US by Acceptance Rate") 
ggarrange(p1, p2, ncol = 2, nrow = 1)

We have more or less clear that better universities are placed mainly in NorthEast and West regions, but let´s confirm it with the previous profile Elite universities:

df = data.frame(acc=college.elite, state=college.states) %>% group_by(state)  %>%
  summarise(X = sum(acc))

plot_usmap(data=df, values = "X") + scale_fill_continuous(
    low = "white", high = "#CB454A", name = "Acc Rate")  + labs(title = "Distribution of Elite universities in the US") 

We can conclude that, effectively, better universities are in that regions.